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Abstract 

A boolean expression is in read-once form if each of its variables appears exactly once. 
When the variables denote independent events in a probability space, the probability of the 
event denoted by the whole expression in read-once form can be computed in polynomial time 
(whereas the general problem for arbitrary expressions is #P-complete). Known approaches to 
checking read-once property seem to require putting these expressions in disjunctive normal 
form. In this paper, we tell a better story for a large subclass of boolean event expressions: 
those that are generated by conjunctive queries without self-joins and on tuple-independent 
probabilistic databases. 

We first show that given a tuple-independent representation and the provenance graph of an 
SPJ query plan without self-joins, we can, without using the DNF of a result event expression, 
efficiently compute its co-occurrence graph. From this, the read-once form can already, if it 
exists, be computed efficiently using existing techniques. Our second and key contribution is 
a complete, efficient, and simple to implement algorithm for computing the read-once forms 
(whenever they exist) directly, using a new concept, that of co-table graph, which can be 
significantly smaller than the co-occurrence graph. 



* A shorter version of this paper will appear in the proceedings of ICDT 201 1 . 
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Figure 1: (a) A tuple-independent probabilistic database, (b) Event table representation, (c) An 
unsafe query. 



1 Introduction 

The computation of distributions for query answers on probabilistic databases is closely related 
to the manipulation of boolean formulas. This connection has led both to interesting theoretical 
questions and to implementation opportunities. In this paper we consider the tuple-independent 
model [j] for probabilistic databases. In such a model probabilistic databases are represented by 
tables whose tuples t are each annotated by a probability value p t > 0, see Fig. |l|fa)| Each tuple 
appears in a possible world (instance) of the representation with probability p t independently of 
the other tuples. This defines a probability distribution on all possible instances. 

Manipulating all possible instances is impossibly unwieldy so techniques have been devel- 
oped Il2"7l[rril3~6l for obtaining the query answers from the much smaller representation tables. This 
is where boolean formulas make their entrance. The idea, by now well-understood (H El QUELL * s 
to define the relational algebra operators on tables whose tuples are annotated with event expres- 
sions. The event expressions are boolean expressions whose variables annotate the tuples in the 
input tables. The computation of event expressions is the same as that used in c-tables [22J as mod- 
els for incomplete and probabilistic databases are closely related IfTBI . Once the event expressions 
are computed for the tuples in the representation table of the query answer (which is in general not 
tuple-independent), probabilities are computed according to the standard laws. 

The event expressions method was called intensional "semantics" by Fuhr and Rolleke and 
they observed that with this method computing the query answer probabilities seems to require 
exponentially many steps in general IfTTll . Indeed, the data complexity |^] of query evaluation on 
probabilistic databases is #P-complete, even for conjunctive queries lfl3l . in fact even for quite 



'This model has been considered as early as [4| as well as in, eg., ifTTl [36l [T3l I§1 . 

2 Here, and throughout the paper, the data input consists of the representation tables [13. 8 1 rather than the collection 
of possible worlds. 
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simple boolean queries (HI such as the query in Fig. |l|fc) 

But Fuhr and Rolleke also observe that certain event independences can be taken advantage 
of, when present, to compute answer probabilities in PTIME, with a procedure called extensional 
"semantics". The idea behind the extensional approach is the starting point for the remarkable 
results flU El of Dalvi and Suciu who discovered that the conjunctive queries can be decidably and 
elegantly separated into those whose data complexity is #P-complete and those for whom a safe 
plan taking the extensional approach can be found. 

Our starting point is the observation that even when the data complexity of a query is #P- 
complete (i.e. the query is unsafe (H), there may be classes of data inputs for which the compu- 
tation can be done with the extensional approach, and is therefore in PTIME. We illustrate with a 
simple example. 

Example 1. Consider the tuple -independent probabilistic database and the conjunctive query Q 
in Fig. [7] Since the query Q is boolean it has just one possible answer and the event expression 
annotating it is^\ 

f = W\V\Ul +W2V2U1 +W3V3W2 + W3V4U3 (1) 

This was obtained with the standard plan 71q((R cx S) X T). However, it is equivalent to another 
boolean expression 

(w\Vi + W 2 vi)U\ + W3(V3«2 + V4W3) (2) 

which has the property that each variable occurs exactly once. 

Event (boolean) expressions in which each variable occurs exactly once are in read-once form 
(see 11281 ). For read-once forms the events denoted by non-overlapping subexpressions are jointly 
independent, so we can use the key idea of the extensional approach: 

Fact If events E\ , . . . , E n are jointly independent then 

P(Etn--nE n ) = P{E x )---P{E n ) (3) 
P(£iU---U£„) = l-[l-P(Pi)]---[l-P(P„)]. (4) 

Example 2 (Example [T] continued). The probability of the answer ([2]) can be computed as follows 

P(f) = P{(w\V\ +W 2 V2)"1 +W 3 (V 3 M2 + V4W3)) = 

1 - [1 -P(wiVl +W 2 V 2 )P(wi)][l -P(W3)P(V 3 W2 + V4«3)] 

where 

P(wm +w 2 v 2 ) = 1 - [1 -P(wi)P(vi)][l -P(w 2 )P(v 2 )] 

and 

P(v 3 u 2 + v 4 u 3 ) = 1 - [1 -P(v 3 )P(k 2 )][1 -P(v 4 )P(u 3 )]. 

We can extend this example to an entire class of representation tables of unbounded size. For each 
n, the relations P, T will have 3n tuples while S will have 4n tuples, and the probability of the 
answer can be computed in time O(n), see Appendix^ 

It is also clear that there is no relational algebra plan that directly yields ([2]) above. 

3 To reduce the size of expressions and following established tradition we use + for V and • for A, and we even 
omit the latter in most terms. 
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In fact, Fuhr and Rolleke (see [fTll . Thm.4.5) state that probabilities can be computed by "sim- 
ple evaluation" (i.e., by the extensional method) if and only if the event expressions computed 
intensionally are in read-once form. Moreover, the safe plans of [|8l are such that all the event 
expressions computed with the intensional method both on intermediary relations and on the final 
answer, are in read-once form. 

But more can be done. Notice that the expression ([T|) is not in read-once form but it is equivalent 
to which is. Boolean expressions that are equivalent to read-once forms have been called 
by various names, eg., separable, fanout-free EOl . repetition-free [fTTll . /i-expressions [|35l . non- 
repeating 11321 . but since the late 80's EH the terminology seems to have converged on read-once. 
Of course, not all boolean expressions are read-once, eg., xy + yz + zx or xy + yz + zu are not. 

With this motivation we take the study of the following problem as the goal of this paper: 

Problem Given tuple-independent database / and boolean conjunctive query Q, when is Q(I) 
read-once and if so, can its read-once form be computed efficiently? 

It turns out that [fT2l gives a fast algorithm that takes a formula in irredundant disjunctive normal 
form, decides whether it is read-once, and if it is, computes the read-once form (which is in fact 
unique modulo associativity and commutativity). The algorithm is based upon a characterization 
in terms of the formula's co-occurrence graph given in [fT8l . 

Some terminology (taken up again in Section [2]). Since we don't have anything to say about 
negation or difference in queries we work only with monotone boolean formulas (all literals are 
positive, only disjunction and conjunction operations). Disjunctive normal forms (DNFs) are dis- 
junctions of implicants, which in turn are conjunctions of distinct variables. A prime implicant of 
a formula E is one with a minimal set of variables among all that can appear in DNFs equivalent to 
E. By absorption, we can retain only the prime implicants. The result is called an irredundant DNF 
(IDNF) of E, and is unique modulo associativity and commutativity. The co-occurrence graph of 
a boolean formula E has its variables as nodes and has an edge between x and y iff they both occur 
in the same prime implicant of E. 

For positive relational queries, the size of the IDNF of the boolean event expressions is poly- 
nomial in the size of the table, but often (and necessarily) exponential in the size of the query. 
This is a good reason for avoiding the explicit computation of the IDNFs, and in particular for not 
relying on the algorithm in [fT2l . In recent and independent work Sen et al. Il33l proved that for 
the boolean expressions that arise out of the evaluation of conjunctive queries without self-joins 
the characterization in |fT8l can be simplified and one only needs to test whether the co-occurrence 
graph is a "cograph" which can be done in linear time 0. 

It is also stated ll33l that even for conjunctive queries without self -joins computing co-occurrence 
graphs likely requires obtaining the IDNF of the boolean expressions. One of our contributions in 
this paper is to show that an excursion through the IDNF is in fact not necessary because the co- 
occurrence graphs can be computed directly from the provenance graph Il25l [T4l that captures the 
computation of the query on a table. Provenance graphs are DAG representations of the event 
expressions in such a way that most common subexpressions for the entire table (rather than just 

4 Defining cographs seems unnecessary for this paper. It suffices to point out that most cograph recognition algo- 
rithms produce (if it exists) something called a "cotree" (sigh) which in the case of co-occurrence graphs associated to 
boolean formulas is exactly a read-once form! 
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each tuple) are not replicated. The smaller size of the provenance graphs likely provides practical 
speedups in the computations (compared for example with the provenance trees of [33]). More- 
over, our approach may be applicable to other kinds of queries, as long as their provenance graphs 
satisfy a simple criterion that we identify. 

To give more context to our results, we also note that Hellerstein and Karpinski[21 J have shown 
that if RP 7^ NP then deciding whether an arbitrary monotone boolean formula is read-once cannot 
be done in PTIME in the size of the formula. 

The restriction to conjunctive queries without self -joins further allows us to contribute improve- 
ments even over an approach that composes our efficient computation of co-occurrence graphs with 
one of the linear-time algorithms for cograph recognition [6l[I!l|3]|. Indeed, we show that only a 
certain subgraph of the co-occurrence graph (we call it the co-table graph) is relevant for our stated 
problem. The co-table graph can be asymptotically smaller than the co-occurrence graph for some 
classes of queries and instances. To enable the use of only part of the co-occurrence graph we 
contribute a novel algorithm that computes (when they exist) the read-once forms, using two new 
ideas: row decomposition and table decomposition. Using just connectivity tests (eg., DFS), our 
algorithm is simpler to implement than the cograph recognition algorithms in [[6l \19[ |3 and it has 
the potential of affecting the implementation of probabilistic databases. 

Moreover, the proof of completeness for our algorithm does not use the cograph characteriza- 
tion on which ll33l relies. As such, the algorithm itself provides an alternative new characterization 
of read-once expressions generated by conjunctive queries without self-joins. This may provide 
useful insights into extending the approach to handle larger classes of queries. 

Having rejected the use of co-occurrence graphs, Sen et al. Il33l provide a different approach 
that derives efficiently the read-once form directly from the computations of the trees underlying 
the boolean expressions, so called "lineage trees", by merging read-once forms that correspond to 
partial formulas. They provide a complexity analysis only for one of the steps that their algorithm 
applies repeatedly. However, to the best of our understanding of the asymptotic complexity of their 
algorithm, it appears that our algorithm is faster at least by a multiplicative factor of k 2 where k is 
the number of tables, and the benefit can often be more. 

It is also important to note that neither the results of this paper, nor those of [33J provide 
complexity dichotomies as does, eg. [8J. It is easy to give a family of probabilistic databases for 



which the query in Fig. 1 c) generates event expressions of the following form: 

X\X2 + X2X3 H \-X n -\X n +x n x n +i. 

These formulas are not read-once, but with a simple memoization (dynamic programming) tech- 
nique we can compute their probability in time linear in n (see Appendix [B]). 



Roadmap. In Section [2] we review definitions, explain how to compute provenance DAGs for 
SPJ queries, and compare the sizes of the co-occurrence and co-table graphs. Section [3] presents 
a characterization of the co-occurrence graphs that correspond to boolean expressions generated 
by conjunctive queries without self-joins. The characterization uses the provenance DAG. With 
this characterization we give an efficient algorithm for computing the co-table (and co-occurrence) 
graph. In Section [4] we give an efficient algorithm that, using the co-table graph, checks if the 
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result of the query is read-once, and if so computes its read-once form. Putting together these 
two algorithms we obtain an efficient query-answering algorithm that is complete for boolean 
conjunctive queries without self -joins and tuple-independent databases that yield read-once event 
expressions. In Section [5] we compare the time complexity of this algorithm with that of Sen 
et al. Il33ll . and other approaches that take advantage of past work in the read-once and cograph 
literature. Related work, conclusions and ideas for further work ensue. 



2 Preliminaries 

A tuple-independent probabilistic database is represented by a usual (set-)relational database 
instance / in which, additionally, every tuple is annotated with a probability in (0,1], see for exam- 



ple Fig. 1 ^a) We call this the probability table representation. We will denote by R = {R\ , . . . 
the relational schema of the representation. By including/excluding each tuple independently 
with probability of its annotation, the representation defines a set of R-instances called possible 
worlds and the obvious probability distribution on this set, hence a discrete probability space. For 
a given tuple t E Ri this space's event "/ occurs" (the set of possible worlds in which t occurs) 
has probability exactly the annotation of t in the representation. Following the intensional ap- 
proach E71 im [36ll we also consider the event table representation which consists of the same 



tables, but in which every tuple is annotated by its unique tuple id, for example see Fig. |l|ft)) 



The tuple ids play three distinct but related roles: (1) they identify tuples uniquely over all 
tables and in fact we will often call the tuple ids just tuples, (2) they are boolean variables, (3) they 
denote the events "the tuple occurs" in the probability space of all possible worlds. The last two 
perspectives can be combined by saying that the tuple ids are boolean- valued random variables 
over said probability space. Moreover, an event expression is a boolean expression with the tuple 
ids as variables. 

The intensional approach further defines the semantics of the relational algebra operators on 

event tables, i.e., relational instances in which the tuples are annotated with event expressions. In 
this paper we will only need monotone boolean expressions because our queries only use selec- 
tion, projection and join and these operators do not introduce negation. Otherwise, joins produce 
conjunctions, projections produce disjunctions, and selections erase the non-compliant tuples. It 
is worth observing that the relational algebra on event tables is essentially a particular case of 
the algebra on c-tables [|22l . and precisely a particular case of the relational algebra on semiring- 
annotated relations lfT5l . 

Since by now they are well understood (see the many papers we cited so far), we do not repeat 
here the definition of select, project, and join on tables annotated with boolean event expressions 
but instead we explain how they produce provenance graphs. The concept that we define here is 
a small variation on the provenance graphs defined in E51 [l4l where conjunctive queries (part of 
mapping specifications) are treated as a black box. It is important for the provenance graphs used 
in this paper to reflect the structure of different SPJ query plans that compute the same conjunctive 
query. 

A provenance graph (PG) is a directed acyclic graph (DAG) H such that the nodes V(H) of 
H are labeled by variables or by the operation symbols • and +. As we show below, each node 
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corresponds to a tuple in an event table that represents the set of possible worlds of either the 
input database or some intermediate database computed by the query plan. An edge u — > v is in 
E(H) if the tuple corresponding to u is computed using the tuple corresponding to v in either a 
join (in which case u is labeled with •) or a projection (in which case u is labeled with +). The 
nodes with no outgoing edges are those labeled with variables and are called leaves while the 
nodes with no incoming edges are called roots (and can be labeled with either operation symbol). 
Provenance graphs (PGs) are closely related to the lineage trees of [|33ll . In fact, the lineage trees 
are tree representations of the boolean event expressions, while PGs are more economical: they 
represent the same expressions but without the multiplicity of common subexpressions. Thus, they 
are associated with an entire table rather than with each tuple separately, each root of the graph 
corresponding to a tuple in the table. ^\ 



Wi W2 W3 

Figure 2: Provenance graph for R tx S. 
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Wl W2 W3 

Figure 3: Provenance graph for iCq((R X S) IX T) 



We explain how the SPJ algebra works on tables with PGs. If tables Ri and R2 have PGs H\ 
and H2 then the PG for Ri tx R2 is constructed as follows. Take the disjoint union H of Hi and H2. 
For every t\ E Ri and ?2 £ ^2 that do join, add a new root labeled with ■ and make the root of H\ 
corresponding to t\ and that of H2 corresponding to t2 children of this new root. Afterwards, delete 
(recursively) any remaining roots from Hi and H2. For example, referring again to Fig. [T] the PG 
associated with the table computed by R tx S is shown in Fig. [2} 

For selection, delete (recursively) the roots that correspond to the tuples that do not satisfy the 
selection predicate. For projection, consider a table T with PG H and X a subset of its attributes. 
The PG for TtxR is constructed as follows. For each t E TtxR, let ti, . . . ,t m be all the tuples in R 
that X-project to t. Add to H a new root labeled with + and make the roots in H corresponding to 
ti,...,t m the children of this new root. Referring again to Fig.[T] the PG associated with the result 
of the query plan Kq((R IX S) tx T) is shown in Fig.[3j Since the query is boolean, this PG has just 
one root. 

5 Note that to facilitate the comparison with the lineage trees the edge direction here is the opposite of the direction 

in Gam. 
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The boolean event expressions that annotate tuples in the event tables built in the intensional 
approach can be read off the provenance graphs. Indeed, if t occurs in an (initial, intermediate, 
or final) table T whose PG is H, then, starting at the root u of H corresponding to t, traverse the 
subgraph induced by all the nodes reachable from u and build the boolean expression recursively 
using parent labels as operation symbols and the subexpressions corresponding to the children as 
operands. For example, we read {{{w\ ■ vi) ■ u\) + {u\ ■ (\V2 • V2)) + («2 • (V3 • W3)) + ((W3 ■ V4) ■ K3)) 
off the PG in Fig. [3} 

The focus of this paper is the case when the boolean (event) expressions are read-once, i.e., 
they are equivalent to expressions in which every variables occurs exactly once, the latter said to 
be in read-once form. For boolean expressions that are read-once, the read-once form is unique 
(modulo associativity and commutativity) |^j The interest in read-once formulas derives from the 
fact that in tuple-independent databases the tuples in the input representation occur independently 
in possible worlds. More complex boolean expressions denote events whose probability needs to 
be computed from the probabilities of the variables, i.e., the probabilities of the independent "tuple 
occurs" events in the input. When such event expressions are in read-once form their probability 
can be computed efficiently in linear time in the number of variables using the rules ([3]) and ([4]) in 
Section Q] 

Given a tuple-independent probabilistic database and an SPJ query plan, hence the resulting 
provenance graph, our objective in this paper is to decide efficiently when the boolean expression(s) 
read off the PG are read-once, and when they are, to compute their read-once form(s) efficiently, 
hence the associated probability (es). 

In this paper we consider only boolean conjunctive queries. We can do this without loss of 
generality because we can associate to a non-boolean conjunctive query Q and an instance / a set 
of boolean queries in the usual manner: for each tuple t in the answer relation Q(I), consider the 
boolean conjunctive query Q t which is obtained from Q by replacing the head variables with the 
corresponding values in t. Note that the PGs that result from boolean queries have exactly one 
root. We will also use Q(I) to denote the boolean expression generated by evaluating the query Q 
on instance /, which may have different (but equivalent) forms based on the query plan. 

Moreover, we consider only queries without self-join. Therefore our queries have the form 

Q():-R 1 (x 1 ),...,R k (x k ) 

where R\ , . . . , R k are all distinct table names while the x ( -'s are tuples of FO variables |^]or constants, 
possibly with repetitions, matching the arities of the tables. If the database has tables that do not 
appear in the query, they are of no interest, so we will always assume that our queries feature all 
the table names in the database schema R. 

As we have stated above, we only need to work with monotone boolean formulas (all literals 
are positive, only disjunction and conjunction operations). Every such formula is equivalent to 
(many) disjunctive normal forms (DNFs) which are disjunctions of conjunctions of variables. 

6 This seems to have been known for a long time. We could not find an explicitly stated theorem to this effect in the 
literature, but, for example, it is clear that the result of the algorithm in lfl2l is uniquely determined by the input. 

7 FO (first-order) is to emphasize the distinction between the variables in the query subgoals and the variables in 
the boolean expressions. 
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These conjunctions are called implicants for the DNF. By idempotence we can take the variables 
in an implicant to be distinct and the implicants of a DNF to be distinct from each other. A prime 
implicant of a formula / is one with a minimal set of variables among all that can appear in DNFs 
equivalent to /. By absorption, we can retain only the prime implicants in a DNF The result is 
called the irredundant DNF (IDNF) of /, as it is uniquely determined by / (modulo associativity 
and commutativity). We usually denote it by /idnf- Note that in particular the set of prime 
implicants is uniquely determined by /. 

The co-occurrence graph, notation G co , of a boolean formula / is an undirected graph whose 
set of vertices V (G co ) is the set Var(/) of variables of / and whose set E(G co ) of edges is defined 
as follows: there is an edge between x and y iff they both occur in the same prime implicant of /. 
Therefore, G co is uniquely determined by / and it can be constructed from /wnf- This construction 
is quadratic in the size of /idnf but of course /idnf can De exponentially larger than /. Fig. 0] 
shows the co-occurrence graph for the boolean expression / in equation ([T]) of Example[TJ As this 
figure shows, the co-occurrence graphs for expressions generated by conjunctive queries without 
self join are always k-partiter\ graphs on tuple variables from k different tables. 



We are interested in the co-occurrence graph G co of a boolean formula / because it plays a 
crucial role in / being read-once. Indeed [ 18 1 has shown that a monotone / is read-once iff (1) it is 
"normal" and (2) its G co is a "cograph". We don't need to discuss normality because ll33l has shown 
that for formulas that arise from conjunctive queries without self -joins it follows from the cograph 
property. We will also avoid defining what a cograph is (see [|5l|6l) except to note that cograph 
recognition can be done in linear time [|6l [191 13 and that when applied to the co-occurrence graph 
of / the recognition algorithms also produce, in effect, the read-once form of /, when it exists. 

Although the co-occurrence graph of / is defined in terms of /idnf, we show in Section [3] 
that when / is the event expression produced by a boolean conjunctive query without self-joins 
then we can efficiently compute the G co of / from the provenance graph H of any plan for the 
query. Combining this with any of the cograph recognition algorithms we just cited, this yields 
one algorithm for the goal of our paper, which we will call a cograph-help algorithm. 

Because it uses the more general-purpose step of cograph recognition a cograph-help algorithm 
will not fully take advantage of the restriction to conjunctive queries without self -joins. Intuitively, 

8 A graph (V\ U ■ ■ • U Vk,E) is k-partite, if for any edge (u, v) e E where u € V; and v e Vj, i ^ j. 




Figure 4: G co for / in Example 1 
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Figure 5: Gt for the relations in Example 1 
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Figure 6: Gq for / in Example 1 



with this restriction there may be lots of edges in G co that are irrelevant because they link tuples 
that are not joined by the query. This leads us to the notion of co-table graph defined below. 

Toward the definition of the co-table graph we also need that of table-adjacency graph, nota- 
tion Gt- Given a boolean query without self-joins QQ : — 7?i(xi), . . . ,7?yt( x /0 me vertex set V(Gt) 
is the set of k table names Ri, ■ ■ • We will say that Ri and Rj are adjacent iff x ; and xj have 
at least one FO variable in common i.e., Ri and Rj are joined by the query. The set of edges 
E{Gt) consists of the pairs of adjacent table names. The table-adjacency graph Gj for the query 
in Example [T] is depicted in Fig.[5j 

The table- adjacency graph Gt helps us remove edges irrelevant to a query from the graph G co . 
For example, if there is an edge between x G Ri and x' G Rj in G co , but there is no edge between 
Ri and Rj in Gj, then (i) either there is no path connecting Rj to Rj in Gt (so all tuples in Ri pair 
with all tuples in Rj), or, (ii) x and x 1 are connected in G co via a set of tuples xi,- • ■ ,X£, such that 
the tables containing these tuples are connected by a path in Gt- Our algorithm in Section [4] shows 
that all such edges (x,x') can be safely deleted from G co for the evaluation of the query that yielded 
Gt- 

Definition 1. The co-table graph Gq is the subgraph of G co with V(Gc) = V(G co ) and such that 
given two tuples x G Ri and x' G Rj there is an edge {x.x 1 ) G E(Gc) W( x ^) e E(G C0 ) and Ri and 
Rj are adjacent in Gt- 

The co-table graph Gq generated by the event tables and query in Fig. [T] is shown in Fig. [6] (it 
is not a cograph!). 

Co-occurrence graph vs. co-table graph The advantage of using the co-table graph instead 
of the co-occurrence graph is most dramatic in the following example: 

Example 3. Consider QQ : — i?i(xi),i?2( x 2) where x\ and X2 have no common FO variable. As- 
suming that each of the tables R\ and R2 has n tuples, G co has n 2 edges while Gq has none. A 
cograph-help algorithm must spend Q.(n 2 ) time even if it only reads G co ! 
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On the other hand, Gq can be as big as G co . In fact, when Gj is a complete graph (see next 
example), G c = G co . 

Example4. Consider Q() : — R\(x\,y),. . . jR^x^y) where X; and Xj have no common FO variable 
ifi 7^ h Here Gj is the complete graph on R\, . . . and Gc = G co . 

However, it can be verified that both our algorithm and the cograph-help algorithm have the 
same time complexity on the above example. 

3 Computing the Co-Table Graph 

In this section we show that given as input the provenance DAG H of a boolean conjunctive query 
plan without self-joins Q on a table-independent database representation /, the co-table graph Gc 
and the co-occurrence graph G co of the boolean formula Q(I) (see definitions in section [2]) can be 
computed in poly-time in the sizes of HJ and Q. 

It turns out that Gc and G co are computed by similar algorithms, one being a minor modification 
of the other. As discussed in section[TJ the co-occurrence graph G co can then be used in conjunction 
with cograph recognition algorithms (eg., (6l[l9l[3]|), to find the read-once form of Q(I) if it exists. 
On the other hand, the smaller co-table graph Gc is used by our algorithm described in section [4] 
for the same purpose. 

We use Var(/) to denote the sets of variables in a monotone boolean expression /. Recall that 
the provenance DAG H is a layered graph where every layer corresponds to a select, project or 
join operation in the query plan. We define the width of H as the maximum number of nodes at 
any layer of the DAG H and denote it by /3# . The main result in this section is summarized by the 
following theorem. 

Theorem 1. Let f = Q(I) be the boolean expression computed by the query plan Q on the table 
representation I (f can also be read off the provenance graph of Q on I, H), n = \ Var(f) \ be the 
number of variables in f, m# = \E(H) \ be the number of edges of H, /3# be the width of H, and 
m co = \E(G co )\ be the number of edges ofG co , the co-occurrence graph of f. 

1. G co can be computed in time 0{nmn + /3#w co ). 

2. Further, the co-table graph Gc of f can be computed in time 0{nmn + fijjfnco + k 2 & log a) 
where k is the number of tables in Q, and cc is the maximum arity (width) of the tables in Q. 

3.1 LCA-Based Characterization of the Co- Occurrence Graph 

Here we give a characterization of the presence of an edge (x,y) in G co based on the least common 
ancestors of x and y in the graph H. 

Again, let / = Q(I) be the boolean expression computed by the query plan Q on the table 
representation /. As explained in section[2]/ can also be read off the provenance graph H of Q and 
/ since H is the representation of / without duplication of common subexpressions. 

The absence of self -joins in Q implies the following. 
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Lemma 1. The DNF generated by expanding f (or H) using only the distributivity rule is in fact 
the IDNFoff up to idempotency (i.e. repetition of the same prime implicant is allowed). 

Proof. Let g be the DNF generated from / by applying distributivity repeatedly. Due to the absence 
of self-joins g every implicant in g will have exactly one tuple from every table. Therefore, for any 
two implicants in g the set of variables in one is not a strict subset of the set of variables in the 
other and further absorption (eg., xy + xyz = xy) does not apply. (At worst, two implicants can be 
the same and the idempotence rule reduces one.) Therefore, g is also irredundant and hence the 
IDNF of / (up to commutativity and associativity). □ 

Denote by fiDNF the IDNF of /, which, as we have seen, can be computed from / just by 
applying distributivity. 

As with any DAG, we can talk about the nodes of H in terms of successors, predecessors, 
ancestors, and descendants, and finally about the least common ancestors of two nodes, denoted 
lca(x,y). Because H has a root lca(x,y) is never empty. When H is a tree, lca(x,y) consists 
of a single node. For a node u E V(H), we denote the set of leaf variables which are descendants 
of u by Var(w) (overloaded notation warning!); in other words, a variable x belongs to Var(w), 
u E V(H), if and only if x is reachable from u in H. Now we prove the key lemma of this section: 

Lemma 2. Two variables x,y E Var(f) belong together to a (prime) implicant of fiDNF if and 
only if the set Ica(x,y) contains a --node. 

Proof, (if) Suppose lca(x,y) contains a --node u, i.e., x,y are both descendants of two distinct 
successors vi, V2 of u. Since the • operation multiplies all variables in Var(vi) with all variables in 
Var(v2), x and y will appear together in some implicant in fwNF which will not be absorbed by 
other implicants by Lemma [T] 

(only if) Suppose that x,y appear together in an implicant of fwNF an d lca(x,y) contains no 
•-node. Then no --node in V(H) has x,y in Var(vi), Var(v2), where vi,V2 are its two distinct 
successors (note that any --node in a provenance DAG H can have exactly two successors). This 
implies that x and y can never be multiplied, contradiction. □ 

Since there are exactly k tables in the query plan, every implicant in fwNF will be of size k. 
Therefore: 

Lemma 3. For every variable x E Var(f) and --node u E V(H), if x E Var(u), then x E Var(y) 
for exactly one successor v ofu. 

Proof. If x E Var(/) belongs to Var(vi), Var(v2) for two distinct successors Vi,V2 of u, then 
some implicant in fiDNF will have < k variables since x ■ x = x by idempotence. □ 

The statement of Lemma [2] provides a criterion for computing G co using the computation of 
least common ancestors in the provenance graph, which is in often more efficient than computing 
the entire IDNF. We have shown that this criterion is satisfied in the case of conjunctive queries 
without self-joins. But it may also be satisfied by other kinds of queries, which opens a path to 
identifying other cases in which such an approach would work. 
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3.2 Computing the Table- Adjacency Graph 



It is easier to describe the computation of Gj if we use the query in rule form Q() : —R\(xi), . . . ,/?^.(x^). 

The rule form can be computed in linear time from the SPJ query plan. Now the vertex set 
V(Gt) is the set of table names R\,--- ,Rk- and an edge exists between Ri,Rj iff x; and Xj have at 
least one FO variable in common i.e., R, and Rj are joined. Whether or not such an edge should 
be added can be decided in time O(aloga) by sorting and intersecting x, and Xj. Here a is the 
maximum arity (width) of the tables R\,- ■■ ,R^. Hence Gj can be computed in time 0{k 2 a\oga). 

3.3 Computing the Co-Table Graph 

Recall that co-table graph Gq is a subgraph of the co-occurrence graph G co where we add an edge 
between two variables x,y, only if the tables containing these two tuples are adjacent in the table- 
adjacency graph Gj. Algorithm [T]CompCoTable constructs the co-table graph Gq by a single 
bottom-up pass over the graph H. 

Algorithm 1 Algorithm CompCoTable 

Input: Query plan DAG H and table-adjacency graph Gj 

Output: Co-table graph Gq. 

1: - Initialize V(G C ) = Var(/), E(G C ) = 0. 

2: - For all variables x G Var(/), set Var(x) = {x}. 

3: - Do a topological sort on H and reverse the sorted order. 

4: for every node u G V(H) in this order do 

5: /* Update Var(u) set for both +-node and --node uV 

6: - Set Var(w) = (J v Var(v), where the union is over all successors v of u. 

7: if u G V(H) is a --node then 

8: /* Add edges to Gq only for a --node*/ 

9: - Let vi , V2 be its two successors. 

10: for every two variables x G Var(vi) and y G Var(v2) do 

11: if (i) the tables containing x,y are adjacent in Gj and (ii) the edge (x,y) does not exist 

in E(Gc) yet then 

12: - Add an edge between x and y in E(Gc) ■ 

13: end if 

14: end for 

15: end if 

16: end for 



It is easy to see that a minor modification of the same algorithm can be used to compute the 
co-occurrence graph G co : in Step [TT] we simply skip the check whether the tables containing the 
two tuples are adjacent in Gj. Since this is the only place where Gj is used, the time for the 
computation of Gq does not include the time related to computing/checking Gj. 
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Correctness. By a simple induction, it can be shown that the set Var(w) is correctly computed 
at every step, i.e., it contains the set of all nodes which are reachable from u 'mH (since the nodes 
are processed in reverse topological order and Var(w) is union of Var(v) for over all successors 
v of u). Next lemma shows that algorithm CompCoTable correctly builds the co-table graph Gq 
(proof in Appendix [C]). 

Lemma 4. Algorithm CompCoTable adds an edge (x,y) to Gc if and only ifx,y together appear 
in some implicant in /idnf and the tables containing x,y are adjacent in Gj. 



Time Complexity. H ere w e give a sketch of the time complexity analysis, details can be found 
in the appendix (Section C.l ). Computation of the table adjacency graph takes 0(fc 2 alog a) time 



as shown in Section 3.2 



The total time complexity of algorithm CompCoTable as given in 
Theorem [I] is mainly due to two operations: (i) computation of the Var(w) set at every internal 
node u 6 V(H), and (ii) to perform the test for pairs x,y at two distinct children of a --node, whether 
the edge (x,y) already exists in Gc, and if not, to add the edge. 

We show that the total time needed for the first operation is 0(nmu) in total: for every internal 
node u E V(H) we can scan the variables sets of all its immediate successor in 0(nd u ) time to 
compute Var(w), where d u is the outdegree of node u in H. This gives total 0{nmn) time. On the 
other hand, for adding edges (x,y) in Gc, it takes total 0(m co j6#) time during the execution of the 
algorithm, where m co is the number of edges in the co-occurrence graph (and not in the co-table 
graph, even if we compute the co-table graph Gc) and /3# is the width of the graph H. To show 
this, we show that two variables x,y are considered by the algorithm at Step [10] if and only if the 
edge (x,y) already exists in the co-occurrence graph G co , however, the edge may not be added to 
the co-table graph Gc if the corresponding tables are not adjacent in the table adjacency graph Gj. 
We also show that any such edge (x,y) will be considered at a unique level of the DAG H. In 
addition to these operations, the algorithm does initialization and a topological sort on the vertices 
which take 0(m# + n^) time (n# = \V(H) |) and are dominated by the these two operations. 



4 Computing the Read-Once Form 

Our algorithm CompRO (for Compute Read-Once) takes an instance I of the schema R = R\ , • • ■ ,7?^, 
a query QQ : — 7?i(xi),/?2(x2), • • • ,Rk( x k) along with the table adjacency graph Gj and co-table 
graph Gc computed in the previous section as input, and outputs whether Q(I) is read-once, (if so 
it computes its unique read-once form). 

Theorem 2. Suppose we are given a query Q, a table -independent database representation I, the 
co-table graph Gc and the table-adjacency graph Gj for Q on I as inputs. Then 

1. Algorithm CompRO decides correctly whether the expression generated by evaluating Q on 
I is read-once, and if yes, it returns the unique read-once form of the expression, and, 

2. Algorithm COMPRO runs in time O^mjCiXog a + (mc + n) min(fc, y/n)), 
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where mj = \E(Gt) \ is the number of edges in Gj, mc = \E(Gc)\ is the number of edges in Gq, 
n is the total number of tuples in I, k is the number of tables, and cc is the maximum size of any 
subgoal. 



4.1 Algorithm CompRO 

In addition to the probabilistic database with tables R\, ■ ■ • an d input query Q, our algorithm 
also takes the table-adjacency graph Gj and the co-table graph Gq computed in the first phase as 
discussed in Section [3] The co-table graph Gq also helps us to remove unused tuples from all the 
tables which do not appear in the final expression - every unused tuple won't have a corresponding 
node in Gq- So from now on we can assume wlog. that every tuple in every table appears in the 
final expression /. 

The algorithm CompRO uses two decomposition operations: Row decomposition is a hori- 
zontal decomposition operation which partitions the rows or tuples in every table into the same 
number of groups and forms a set of sub-tables from every table. On the other hand, Table de- 
composition is a vertical decomposition operation. It partitions the set of tables into groups and a 
modified sub-query is evaluated in every group. For convenience, we will represent the instance 
I as R\ [T\],- ■ ■ ,Rk[Tk], where 7] is the set of tuples in table R{. Similarly, for a subset of tuples 
T( C Tt, Ri[T/] will denote the instance of relation Rj containing exactly the tuples in T(. The 
algorithm CompRO is given in Algorithm [2| 

Row Decomposition. The row decomposition operation partitions the tuples variables in every 
table into t disjoint groups. In addition, it decomposes the co-table graph Gq into t > 2 disjoint 
induced subgraph^ corresponding to the above groups. For every pair of distinct groups j,f, 
and for every pair of distinct tables Ri^R?, no tuple in group j of Rj ever joins with a tuple in 
group / of Rj/ (recall that the query does not have any self -join operation). The procedure for row 
decomposition is given in Algorithm 3 
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Table Decomposition. On the other hand, the table decomposition operation partitions the 
set of tables R = R\, ■ • • mt0 ^ > 2 disjoint groups Rj, • • • , Rg. It also decomposes the table- 
adjacency graph Gt and co-table graph Gq into £ disjoint induced subgraphs Gj,i, • ■ ■ , G T j, and, 
Gcj, • • • , Get respectively corresponding to the above groups. The groups are selected in such a 
way that all tuples in the tables in one group join with all tuples in the tables in another group. This 
procedure also modifies the sub-query to be evaluated on every group by making the subqueries 
of different groups mutually independent by introducing free variables, i.e., they do not share 
any common variables after a successful table decomposition. Algorithm [4] describes the table 
decomposition operation. Since the table decomposition procedure changes the input query Q to 
Q = Q\ , • • ■ , Qg, it is crucial to ensure that changing the query to be evaluated does not change the 
answer to the final expression. This is shown in Lemma 1 1 in Appendix |Dj 



9 A subgraph H of G is an induced subgraph, if for any two vertices m, v € V(H), if (u, v) € E(G), then (u, v) G E(H). 
10 It should be noted that the row decomposition procedure may be called on [T- ], ■ ■ ■ ,Ri p [T/] and G' c , where 
, • • • ,Rj p is a subset of the relations from Ri, - ■ ■ ,Rk, T( , ■ ■ ■ , T[ are subsets of the respective set of tuples 7/ 15 • • ■ , 7} , 

and G' c is the induced subgraph of Gq on 7]' , • • • , T- . For simplicity in notations, we use /?i[7i], • • • ,^[7^]. This holds 

for table decomposition as well. 
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Algorithm 2 CompRO(& I = [7i],- • • ,R k [T k ]), G c , G T , Flag) 



Input: Query Q, tables R\ [Ti], ■ ■ ,Rk[Tk], co-table graph Gc, table-adjacency graph Gj, and 
a boolean parameter Flag which is true if and only if row decomposition is performed at 
the current step. 

Output: If successful, the unique read-once form /* of the expression for Q(I) 
if k = 1 then 

return Y*xeT x x with success. (/* all unused tuples are already removed */) 
end if 

if Flag = True then {/* Row decomposition */} 
-PerformRD((/?i[7i],--- ,R k [T k }),G c ). 

if row decomposition returns with success then {/* RD partitions every table and Gc into 
l>2 disjoint groups*/} 

- Let the groups returned be ((Tf 7 , ■ ■ • , T k ] ),Gcj), j G 

- Vj G [1,4 let fj = COMPRO(£, (Ri[Tj],-~,R k [T{]), G CJ ,G T , FALSE), 
return /* = f\-\ V ft with success. 

end if 

else {/* Table decomposition */} 

-Perform TD((i? 1 [r 1 ], ••• ,R k [T k ]), Q, G T ,G C ). 

if table decomposition returns with success then {/* TD partitions 7, Gc and Gj into £>2 
disjoint groups, Ly=i kj = k */} 

Let the groups returned be ( (R J} i , • • • , R jjkj ) , Qj , G C j , G r j ) , j G [ 1 , £] . 
-VjG [1, £],fj = COMPRO(<2;, (i?i[r 1 ],---,^[^]),G Cj ,Gr,. / ,TRUE). 
return f* = f\ - ■ ■ ■■ fi with success, 
end if 
end if 

if the current operation is not successful then {/* Current row or table decomposition is not 
successful and k > 1 */} 
20: return with failure: "2(7) is not read-once". 
21: end if 



The following lemma shows that if row-decomposition is successful, then table decomposition 
cannot be successful and vice versa. However, both of them may be unsuccessful in case the final 
expression is not read-once. The proof of the lemma is in Appendix [D]). 

Lemma 5. At any step of the recursion, if row decomposition is successful then table decomposi- 
tion is unsuccessful and vice versa. 

Therefore, in the top-most level of the recursive procedure, we can verify which operation 
can be performed - if both of them fail, then the final expression is not read-once which follows 
from the correctness of our algorithm. If the top-most recursive call performs a successful row 
decomposition initially the algorithm CompRO is called as CompRO(<2, (Ri [7i], • • ■ ,R k [T k ]), Gc, 
Gj, True). The last boolean argument is True if and only if row decomposition is performed at 
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Algorithm 3 RD((7? 1 fr], ■ ■ ■ ,R k [T k ]),G' c ) 

Input: Tables Ri [T\], ,Rk[Tk], and induced subgraph G' c of Gc on U;=i 71 

Output: If successful, the partition of G c and tuple variables of every input tables into I > 2 

connected components : ((7i j, - • • ,T k j),G' c ■ ) , j e [ 1 , £] 

1: - Run BFS or DFS to find the connected components in G' c . 
2: - Let £ be the number of connected components. 
3: if I = 1 then {/* there is only one connected component */} 
4: return with failure: "Row decomposition is not possible". 
5: else 

6: - Let the tuples (vertices) of table Rj in the j'-th connected component j of G' c be 7y 

7: - Let the induced subgraph for connected component j be Gc j. 

8: return ((7/ M ,--- , 7/ M ), G c l ), ((7i^,--- ,T k/ },G' c/ ) with success. 

9: end if 



the current level of the recursion tree. If in the first step table decomposition is successful, then the 
value of the last boolean variable in the initial call will be False. 

Correctness. The following two lemmas respectively show the soundness and completeness of 
the algorithm CompRO (proofs are in Appendix [D]). 

Lemma 6. (Soundness) If the algorithm returns with success, then the expression f* returned by 
the algorithm CompRO is equivalent to the expression Q(I) generated by evaluation of query Q 
on instance I. Further, the output expression f* is in read-once form. 

Lemma 7. (Completeness) If the expression Q{I) is read-once, then the algorithm CompRO 
returns the unique read-once form f* of the expression. 

For completeness, it suffices to show that if Q(I) is read-once, then the algorithm does not exit 
with error. Indeed, if the algorithm returns with success, as showed in the soundness lemma, the 
algorithm returns an expression /* in read-once form which is the unique read-once form of Q(I) 

nana. 



Time Complexity. Consider the recursion tree of the algorithm CompRO. Lemma[5] shows that 
at any level of the recursion tree, either all recursive calls use the row decomposition procedure, 
or all recursive calls use the column decomposition procedure. The time complexity of CompRO 
given in Theorem [2] is analyzed in the following steps. If n' = the total number of input tuples at 
the current recursive call and m' c = the number of edges in the induced subgraph of G' c on these 
n' vertices, we show that row decomposition takes 0(m' c + n') time and, not considering the time 



needed to compute the modified queries Qj (Step 13 in Algorithm 4), the table decompositions 
procedure takes 0(m' c + n') time. Then we consider the time needed to compute the modified 
queries and show that these steps over all recursive calls of the algorithm take 0(mT<x\o% a) time 
in total, where a is the maximum size of a subgoal in the query Q. Finally, we give a bound of 
0(min(fc, y/n)) on the height of the recursive tree for the algorithm CompRO. However, note that 
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Algorithm 4 TD^/^], ,^[7*]), (gQ : -Rijxt), ■ ■ ■ ,R k (x k )), G' C ,G' T ) 

Input: Tables R\[T\],--- ,Rk[Tk] query Q() : —Ri(x\),--- ,^( x k) induced subgraph G' T of Gj 
on UiLi Rh induced subgraph G' c of Gq on |jf =1 7] 

Output: If successful, a partition of input tables, G' T , G' c into £ groups, and an updated sub- 
query for every group 

1: for all edges e = (Ri,Rj) in G' T do 

2: - Annotate the edge e with common variables C e in the vectors x,, Xj. 

3: - Mark the edge e with a "+" if for every pair of tuple variables iGT) and y G Tj, the edge 

(jc,v) exists in G' c . Otherwise mark the edge with a "— ". 
4: end for 

5: - Run BFS or DFS to find the connected components in Gj w.r.t "— " edges 

6: - Let t be the number of connected components. 

7: if I = 1 then {/* there is only one connected component */ } 

8: return with "Failure: Table decomposition is not possible". 

9: else 

10: - Let G'j j, • • • , G' T p be the induced subgraphs of i connected components of G' T and 

G' c j , ■ ■ • , G' c p be the corresponding induced subgraph for G' c . 
11: - Let R p = (R p ,i,- ■ ■ ,R p ,k p ) be the subset of tables in the p-th component of G' T , p G [1,^]. 
12: /* Compute a new query for every component */ 
13: for every component p do 
14: for every table Rj in this component p do 

15: - Let C, = |J e C e be the union of common variables C e over all edges e from Rj to tables 

in different components of G T i (all such edges are marked with '+') 

16: - For every common variable z G Q, generate a new (/reej variable z', and replace a// 

occurrences of z in vector x, by z' ■ Let Xj be the new vector. 

17: - Change the query subgoal for Rt from Ri(x\) to Ri(x\) . 

18: endfor 

19: Let Q p () : -R p ^(x^i),- ■ ■ ,R p ,k p (x^k p ) be the new query for component p. 
20: endfor 

21: return ((R^i, ••• ,^i, fcl ),2i,G c ,i,G r; i), ((fyi,-- ,R£,k e ),Q£,G C /,G T/ ) with success. 
22: end if 



at every step, for row or table decomposition, every tuple in G' c goes to exactly one of the recursive 
calls, and every edge in G' c goes to at most one of the recursive calls. So for both row and table 
decomposition at every level of the recursion tree the total time is 0(mc + n) . Combining all these 
observations, the total time complexity of the algorithm is 0(mT<x\o%Ct + (mc + n)min(A:, y/n)) as 
stated in Theorem [2} The details can be found in Appendix |D.l ). 

Example. Here we illustrate our algorithm. Consider the query Q and instance I from Ex- 
ample [T] in the introduction. The input query is Q() : — R(x)S(x,y)T(y). In the first phase, the 
table- adjacency graph Gj and the co-table graph Gq are computed. These graphs are depicted in 
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Figure 7: G c ,i and G c ,2- 

Fig. [5] and Fig. [^respectively. 

Now we apply CompRO. There is a successful row decomposition at the top-most recursive 
call that decomposes Gq into the two subgraphs Gc,i,Gc,2 shown in Fig. [7] So the final expression 
f* annotating the answer Q(I) will be the sum of the expressions fi,f 2 annotating the answers of 
Q applied to the relations corresponding to Gc,i and Gc,2 respectively. 

The relations corresponding to Gq i are 
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and, the relations corresponding to Gc,2 are 
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Now we focus on the first recursive call at the second level of recursion tree with input co-table 
subgraph Gc,i- Note that the table-adjacency graph for this call is the same as Gj. At this level the 
table decomposition procedure is invoked and the edges of the table- adjacency graph are marked 
with + and — signs, see Fig. [8] In this figure the common variable set for R, S on the edge (R, S) is 
{x}, and for S, T on the edge (5, T) is {y}. Further, the edge (5, T) is marked with a "+" because 
there are all possible edges between the tuples in S (in this case tuples v\,v 2 ) and the tuples in T 
(in this case ui). However, tuples in R (here w\,w 2 ) and tuples in S (here vi,v 2 ) do not have all 
possible edges between them so the edge (R,S) is marked with a "— ". 

Table decomposition procedure performs a connected component decomposition using "— "- 
edges, that decomposes Gj in two components and {T}. The subset C of common variables 

collected from the "+"-edges across different components will be the variables on the single edge 
(S,T), C = {y}. This variable y is replaced by new free variables in all subgoals containing it, 
which are S and T in our case. So the modified queries for disjoint components returned by the 
table decomposition procedure are <2i() : — R(x)S(x,yi) and Q 2 () : —T(y 2 ). The input graph Gc,i 
is decomposed further into Gc.i,i and Gc.i.2, where Gc,i,i will have the edges (wi, vi) and (w 2 , v 2 ), 
whereas Gc, 1,2 will have no edges and a single vertex u\ . Moreover, the expression /j is the product 
of fi 1 and fn generated by these two queries respectively. Since the number of tables for Q2 is only 
one, and T has a single tuple, by the base step (Step [2]) of CompRO, /12 = u\. For expression fn 
from Qi, now the graph Gc,i,i can be decomposed using a row decomposition to two connected 
components with single edges each ((wi,vi) and (w2,V2) respectively). There will be recursive 
subcalls on these two components and each one of them will perform a table decomposition (one 
tuple in every table, so the single edges in both calls will be marked with "+")• Hence fn will be 
evaluated to fn = wiv\ +W2V2. So fi = fn ■ fu = (wivi + w 2 V2)u\. 
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R -,x S +,y T 



Figure 8: Marked table- adjacency graph for R, 5, T . 



By a similar analysis as above, it can be shown that the same query Q evaluated on the tables 
R,S, T given in the above tables give f% = w^v-^v^ + U2U3). So the overall algorithm is successful 
and outputs the read-once form f* = f l +f 2 = (w\V\ J r W2Vi)u\ +W3(v3«2 + V4W3). 

5 Discussion of Time Complexity of Query-Answering Algo- 
rithm 

Putting together our results from Sections [3]and[4j we propose the following algorithm for answer- 
ing boolean conjunctive queries without self -joins on tuple-independent probabilistic databases. 

Phase (Compute provenance DAG) 
Input: query Q, event table rep / 
Output: provenance DAG H 
Complexity: 0((f) k ) 

Phase 1 (Compute co-table graph) 
Input: H, Q 

Output: table- adjacency graph Gj, co-table graph Gq 
Complexity: 0(nmn + ^Htn co + k 2 a\og a) (Thm. [j} 

Phase 2 (Compute read-once form) 
Input: event table rep /, Q, Gt, Gq 
Output: read-once form /* or FAIL 

Complexity: 0(mr CC log a + (mc + n)min(k,y/n)) (Thm. [I]) 

Size of the provenance DAG H. Let / be the boolean event expression generated by some query 
plan for Q on the database /. The number of edges win in the DAG H represents the size of the 
expression /. Since there are exactly k subgoals in the input query Q, one for every table, every 
prime implicant of /idnf will have exactly k variables, so the size of /idnf is at most (v) < (¥) . 
Further, the size of the input expression / is maximum when / is already in IDNF. So size of 
the DAG H is upper bounded by m# < (tt) . Again, the "leaves" of the DAG H are exactly the 
n variables in /. So m# > n — 1, where the lower bound is achieved when the DAG H is a tree 
(every node in H has a unique predecessor); in that case / must be read-once and H is the unique 
read-once tree of /. 

Therefore n — 1 < mn < (¥) ■ Although the upper bound is quite high, it is our contention 
that for practical query plans the size of the provenance graph is much smaller than the size of the 
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corresponding IDNF. 

Data complexity. The complexity dichotomy of JH1 is for data complexity, i.e., the size of the 
query is bounded by a constant. This means that our k and a are 0(1). Hence the time complexities 
of the Phase 1 and Phase 2 are 0(mn# + /3#m co ) and 0(mc + n) min(fc, y/n)) respectively. As 
discussed above, m# is Q.(n) and 0(Cj-) k ). So one of these two terms may dominate the other 
based on the relative values of m#,mc and m co and of /3#. For example, when m# = 6(n), m co = 
0(mc) = 0(n 2 ), and /3# = 0(1), the first phase takes 0(n 2 ) time, whereas the second phase may 

5 3 

take 0(ni) time. However, when m# = D.(n^), the first phase always dominates. 

In any case, we take the same position as [33] that for unsafe [8] queries the competition 
comes from the approach that does not try to detect whether the formulas are read-once and in- 
stead uses probabilistic inference E6l which is in general EXPTIME. In contrast, our algorithm 
runs in PTIME, and works for a larger class of queries than the safe queries flH) (but of course, not 
on all instances). 

Comparisons with other algorithms. For these comparisons we do not restrict ourselves to data 
complexity, instead taking the various parameters of the problem into consideration. 

First consider the general read-once detection algorithm. This consists of choosing some 
plan for the query, computing the answer boolean event expression /, computing its IDNF, and 
then using the (so far, best) algorithm lfl2l to check if / is read-once and if so to compute its 
read-once form. The problem with this approach is that the read-once check is indeed done in time 
a low polynomial, but in the size of /jdnf- For example, consider a boolean query like the one 
in Example [3} This is a query that admits a plan (the safe plan!) that would generate the event 
expression (x\ +yi) • • ■ (x n +y n ) on an instance in which each Ri has two tuples Xj and y,;. This is a 
read-once expression easily detected by our algorithm, which avoids the computation of the IDNF. 

Next consider the cograph-help algorithm that we have already mentioned and justified in 
Section [2} This consists of our Phase and a slightly modified Phase 1 that computes the co- 
occurrence graph G co , followed by checking if G co is a cograph using one of the linear-time al- 
gorithms given in [0 [191 [3l which also outputs the read-once form if possible. Since Phase and 
Phase 1 are common we only need to compare the last phases. 

The co-graph recognition algorithms will all run in time 0(m co +n). Our Phase 2 complexity is 
better than this when mcmm(k, y/n) — o(m co ). Although in the worst case this algorithm performs 
at least as well as our algorithm (since mc may be d(m c0 )), (i) almost always the time required 
in first phases will dominate, so the asymptotic running time of both these algorithms will be 
comparable, (ii) as we have shown earlier, the ratio ^ can be as large as £l(n 2 ), and the benefit 
of this could be significantly exploited by caching co-table graphs computed for other queries (see 
discussions in Section [7]), and (iii) these linear time algorithms use complicated data structures, 
whereas we use simple graphs given as adjacency lists and connectivity-based algorithms, so our 
algorithms are simpler to implement and may run faster in practice. 

Finally we compare our algorithm against that given in [|33l . Let us call it the lineage-tree 
algorithm since they take the lineage tree of the result as input as opposed to the provenance 
DAG as we do. Although 11331 does not give a complete running time analysis of the lineage tree 
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algorithm, for the brief discussion we have, we can make, to the best of our understanding, the 
following observations. 

Every join node in the lineage tree has two children, and every project node can have arbitrary 
number of children. When the recursive algorithm computes the read-once trees of every child of 
a project node, every pair of such read-once trees are merged which may take 0(n 2 k 2 ) time for 
every single merge (since the variables in the read-once trees to be merged are repeated). Without 
counting the time to construct the lineage tree this algorithm may take 0(Nn 2 k 2 ) time in the worst 
case, where N is the number of nodes in the lineage tree. 

Since ll33l does not discuss constructing the lineage tree we will also ignore our Phase 0. We 
are left with comparing N with m#. It is easy to see that the number of edges in the provenance 
DAG H, rtiH = 9 (N) , where N is the number of nodes in the lineage tree, when both originate from 



the same query plarl^JSince the lineage-tree algorithm takes 0(Nn 1 k z ) time in the worst case, and 
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we use 0(nniH + ^Hm co + & 2 alogoc) + 0((mc + n)min(fc, y/n)) = 0(nN + /3#n 2 + fc 2 aloga + 
«2). The width /3# of the DAG H in the worst case can be the number of nodes in H. So our 
algorithm always gives an 0(k 2 ) improvement in time complexity over the lineage-tree algorithm 
given in [33 1 whereas the benefit can often be more. 



6 Related Work 

The beautiful complexity dichotomy result of [8J classifying conjunctive queries without self -joins 
on tuple-independent databases into "safe" and "unsafe" has spurred and intensified interest in 
probabilistic databases. Some papers have extended the class of safe relational queries [|9l l29ll30l 
13 • Others have addressed the question of efficient query answering for unsafe queries on some 
probabilistic databases. This includes mixing the intensional and extensional approaches, in effect 
finding subplans that yield read-once subexpressions in the event expressions [|23l . The technique 
identifies "offending" tuples that violate functional dependencies on which finding safe plans relies 
and deals with them intensionally. It is not clear that this approach would find the read-once forms 
that our algorithm finds. The OBDD-based approach in [|29ll works also for some unsafe queries on 
some databases. The SPROUT secondary-storage operator OTTl can handle efficiently some unsafe 
queries on databases satisfying certain functional dependencies. 

Exactly like us, [33] looks to decide efficiently when the extensional approach is applicable 
given a conjunctive query without self-joins and a tuple-independent database. We have made 
comparisons between the two papers in various places, especially in Section [5J Here we only add 
that that our algorithm deals with different query plans uniformly, while the lineage tree algorithm 
needs to do more work for non-deep plans. The graph structures used in our approach bear some 
resemblance to the graph-based synopses for relational selectivity estimation in 11341 . 

The read-once property has been studied for some time, albeit under various names [|20l [T71 
[351 EQ HIM 123. 11 was shown B2TJ that if RP^NP then read-once cannot be checked in PTIME 
for arbitrary monotone boolean formulas, but for formulas in IDNF (as input) read-once can be 

11 If we "unfold" provenance DAG H to create the lineage tree, the tree will have exactly mn edges, and the number 
of nodes in the tree will be N = Big + 1 . 
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checked in PTIME lfT2l . Our result here sheds new light on another class of formulas for which 
such an efficient check can be done. 

7 Conclusions and Future Work 

We have investigated the problem of efficiently deciding when a conjunctive query without self- 
joins applied to a tuple-independent probabilistic database representation yields result representa- 
tions featuring read-once boolean event expressions (and, of course, efficiently computing their 
read-once forms when they exist). We have given a complete and simple to implement algorithm 
of low polynomial data complexity for this problem, and we have compared our results with those 
of other approaches. 

As explained in the introduction, the results of this paper do not constitute complexity di- 
chotomies. However, there is some hope that the novel proof of completeness that we give for our 
algorithm may be of help for complexity dichotomy results in the space coordinated by the type of 
queries and the type of databases we have studied. 

Of independent interest may be that we have also implicitly performed a study of an interesting 
class of monotone boolean formulas, those that can be represented by the provenance graphs of 
conjunctive queries without self -joins (characterizations of this class of formulas that do not men- 
tion the query or the database can be easily given). We have shown that for this class of formulas 
the read-once property is decidable in low PTIME (the problem for arbitrary formulas is unlikely 
to be in PTIME, unless RP=NP). Along the way we have also given an efficient algorithm for 
computing the co-occurrence graph of such formulas (in all the other papers we have examined, 
computing the co-occurrence graph entails an excursion through computing a DNF; this, of course, 
may be the best one can do for arbitrary formulas, if RP^NP). It is likely that nicely tractable class 
of boolean formulas may occur in other database applications, to be discovered. 

For further work one obvious direction is to extend our study to larger classes of queries and 
probabilistic databases [|9l 13. Recall from the discussion in the introduction however, that the 
class of queries considered should not be able to generate arbitrary monotone boolean expressions. 
Thus, SPJU queries are too much (but it seems that our approach might be immediately useful in 
tackling unions of conjunctive queries without self -joins, provided the plans do the unions last). 

On the more practical side, work needs to be done to apply our approach to non-boolean 
queries, i.e., they return actual tables. Essentially, one would work with the provenance graph 
associated with each table (initial, intermediate, and final) computing simultaneously the co-table 
graphs of the event expressions on the graph's roots. It is likely that these co-table graphs can be 
represented together, with ensuing economy. 

However we believe that the most practical impact would have the caching of co-table graphs 
at the level of the system, over batches of queries on the same database, since the more expensive 
step in our algorithm is almost always the computation of the co-table graph (see discussion in 
Section [5]>. 

This would work as follows, for a fixed database /. When a (let's say boolean for simplicity) 
conjunctive query Q\ is processed, consider also the query Q\ which is obtained from Q\ by 
replacing each occurrence of constants with a distinct fresh FO variable. Moreover if an FO 
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variable x occurs several times in a subgoal /?,-(x,-) of Q but does not occur in any of the other 
subgoals (i.e., x causes selections but not joins), replace also each occurrence of x with a distinct 
fresh FO variable. In other words, Q\ is doing what Q\ is doing, but it first applies some selections 
on the various tables of /. We can say that Q\ is the "join pattern" behind Q\. Next, compute the 
co-table graph for Q\ on / and cache it together with <2i • It is not hard to see that the co-table 
graph for <2i can be efficiently computed from that of Q\ by a "clean-up" of those parts related to 
tuples of / that do not satisfy the select conditions of Q\. 

When another query Q2 is processed, check if its join-pattern Q2 matches any of the join- 
patterns previously cached (if not, we further cache its join-pattern and co-table graph). Let's say 
it matches Q\. Without defining precisely what "matches" means, its salient property is that the 
co-table graph of Q2 can be efficiently obtained from that of Q2 by another clean-up, just of edges, 
guided by the table-adjacency graph of Q2 (which is the same as that of Q2). It can be shown that 
these clean-up phases add only an O(na) to the running time. 

There are two practical challenges in this approach. The first one is efficiently finding in the 
cache some join-pattern that matches that of an incoming query. Storing the join-patterns together 
into some clever data structure might help. The second one is storing largish numbers of cached 
co-table graphs. Here we observe that they can all be stored with the same set of nodes and each 
edge would have a list of the co-table graph it which it appears. Even these lists can be large, in 
fact the number of all possible joint-patterns is exponential in the size of the schema. More ideas 
are needed and ultimately the viability of this caching technique can only be determined experi- 
mentally. 
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A Generalization of Example 1 



Example 5. 
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Answer event expression with standard plan: 

X\Z\U\ + y\Z2U\ H VX2nZAn-\U2n+X2nZAnV2n (5) 

Equivalent to: 

{X\Z\ +y\Z2)U\ H ^ X2n{ZAn-\U2n J r ZAnV2n) (6) 

which is in read-once form and whose probability can be computed in time 0(n). 

B PTIME Probability Computation for Non-Read-Once Ex- 
pressions 

The following example shows that there exists a query Q and a probabilistic database instance D 
such that the expression E for evaluation of query Q on database D is not read-once but still the 
probability of E being true can be computed in poly-time. 

Example 6. The database D has three tables R(A),S(A,B),T(B). Table S has n tuples. The tuple 
in the 2i— \ -th row has tuple {a^bt) and the tuple in the 2i-th row has tuple (a ;+ i,^ ; ). We suppose 
that S is a deterministic relation, i.e. all the tuples is S belong to S with probability one and are 
annotated with true. Ifn = 2k then table R has k+\ tuples, ifn = 2k—\ then R has k tuple. The 
tuple in row j of R is aj and is annotated by X2j-\. Table T has k tuples, where n = 2k—\ or 
n = 2k: the tuple in row j is bj and is annotated by %2j. It can be verified that the expression E n 
annotating the answer to the query Q() = R(A),S(A,B),T(B) is 

E n = X\X2 +X2X3 + ■ ■ .X n -\X n +X n X n +l. 
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An example with n = 1is given in Figure^ It can be easily verified that for alln> 1, the expression 
E„ is not read-once. 
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Figure 9: Illustration with n = 3, E3 = x\x 2 + x 2 X3 +X3X4. 



Next we show that P n = P(E n ) can be computed in poly-time in n by dynamic programming. 
Note that P\ can be computed in 0(1) time. Suppose for all £ < n, Pp is computed and stored in an 
array. Then 



Pfl P(x\X 2 -\~ ■ ■ ■ ~\~ X n — 2 X n — l -\- X n — \X n ~\~ X n X n J\-l) 

= P{X\X 2 + ... +X n - 2 X n -l +x n -\x n ) +P(x n x n+ i) 

P{x n Xn+\ [xiX 2 + ... +X„-2X n -l +X n -lX n }) 
= P n -i+P(XnX n+ i) (7) 
-P(x n X n+ i \X\X 2 + ... +X n -2Xn-\ +x n -\x n }) (8) 

Observe that: 

[X\X 2 + ... +X n - 2 X n -l +X n -lX n }) 
= P{x n +\)P{x n [x\x 2 + ... +X n - 2 X n -l +X n -lX n ]) 
= P(x n+ \)P(x\x 2 x n + ... +x n -2X n -ix n +x n -\x n ) (idempotency) 
= P(x n+ \)P{x\x 2 x n + ... +x n -3X n - 2 x n +x n -ix n ) (absorption) 

= P{x n +\)P{x n )P{x\X 2 + ... +X n -3X n - 2 +x„_i) 
= "\XnXn-\-l 

) [P(xiX 2 + ... + X n _ 3 X„_ 2 ) +P(x n -l)} 

)[P n . 3 +P(x n -i)] (9) 

From ® and ©, 

Pn = Pn-l +P(x n X n+ i)-P(x n X n+ i)[P n -3+P(x n -i)] 
= P n i + P(x n X n+ i)[l -P n _ 3 -P{x n -!)} 

Since the variables Xi-s are independent, P(x n x n+ i) = P(x n )P(x n+ i), and while computing P n , 
P n -l and Pn-3 are already available. Hence P n can be computed in linear time. 
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C Proofs from Section 3 



Proof of Lemma |H 

Lemma[4J Algorithm CompCoTable adds an edge (x,y) to Gq if and only ifx,y together appear 
in some implicant in /idnf and the tables containing x,y are adjacent in Gj. 

Proof. Suppose two variables x,y belong to the same implicant in fwNF, and their tables are 
adjacent in Gj. Then by Lemma[2j there is a --node u E lca(x,j), and x E Var(vi),y E Var(v2) 
for two distinct successors vi , V2 of u. When the algorithm processes the node u, if an edge between 
jc,y is not added in a previous step, the edge will be added. This shows the completeness of 
algorithm CompCoTable. 

Now we show the soundness of the algorithm. Consider two variables x,y such that either 
the tables containing them are not adjacent in Gj or they do not belong together in any of the 
implicants in /idnf- If the tables containing x,y are not adjacent in Gj, clearly, the algorithm 
never adds an edge between them - so let us consider the case when jc,y do not belong to the same 
implicant in /wnf- Then by Lemma [2} there is no --node u E lca(;t,y). 

Consider any iteration of the algorithm and consider that a node u is processed by the algorithm 
in this iteration. If u is a +-node or if either x ^ Var(w) or y ^ Var(a), again no edge is added 
between x,y. So assume that, u is a --node and x,y E Var(w). Then u is a common ancestor of x 
andy. But since u lca(x,y), by definition of least common ancestor set, there is a successor v of 
u such that v is an ancestor of both x,y and therefore, x,y E Var(v). However, by Corollary [3j since 
x or y cannot belong to two distinct successors of node u, node v must be the unique successor of 
u such that x,y E Var(v). Since CompCoTable only joins variables from two distinct children, 
no edge will be added between x and y in Gc- □ 



C.l Time Complexity of CompCoTable 

First we prove the following two lemmas bounding the number of times any given pair of variables 
x,y are considered by the algorithm. The first lemma shows that the variables x,y are considered 
by algorithm CompCoTable to add an edge between them in G co only when they together appear 
in an implicant in fiDNFi i-e. only if the edge actually should exist in G co . 

Lemma 8. Consider any two variables x,y and a --node u. If x,y do not appear together in 
an implicant in /idnf, x,y do not belong to the variable sets Var(v\), Variyi) for two distinct 
successors vy.v^ofu. 

Proof. This easily follows from Lemma [2] which says that if x,y do not appear together in an 
implicant in fwNFt then there is no -node in lca(jc,y). So for every --node u, either (i) one of x 
and y ^ Var(a), or, (ii) there is a unique successor v of u which is a common ancestor of x,y, i.e. 
both jcjG Var(v) (uniqueness follows from Corollary pp. □ 

The second lemma bounds the number of times a pair x,y is considered by the algorithm to add 
an edge between them. 
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Lemma 9. Suppose xjG Var(f) be such that they together appear in an implicant /idnf- Then 



algorithm CompCoTable considers x,y in Step 10 to add an edge between them maximum /3# 



times, where /3# is the width of the provenance DAG H. 



Proof. Note that the check in Step 10 is performed only when the current node u is a --node. 
Consider any --node u. (i) if either x or y is not in Var(n), clearly, x,y are not checked in this 
step, otherwise, (ii) if both x,y G Var(u), and x,y G Var(v) for a unique child v of u, then also x, y 
are not checked at this step, otherwise, (iii) if u joins x,y, i.e., x G Var(vi), y G Var(v2) for two 



distinct children V\,V2 of u, then only x,y are considered by the algorithm in Step 10 (and after 
this node u is processed, both x,y appear in Var(w)). 

However, since the query does not have any self -joins, the only time two variables x,y appear 
in two distinct successors of a -node u when the query plan joins a subset of tables containing the 
table for jc with a subset of tables containing the table fory. So the pair x, y is multiplied at a unique 
layer of H, and the total number of times they are multiplied cannot exceed the total number of 
nodes in the layer which is at most the width /3# of the DAG H. □ 

Now we complete the running time analysis of algorithm CompCoTable. 

Lemma 10. Given the table-adjacency graph Gj and input query plan H, algorithm Comp- 
CoTable can be implemented in time 0(f$H m co -\-nmjj) time, where m co is the number of edges 
in the co-occurrence graph, m# is the number of edges in the DAG H, /3# is the width of the DAG 
H and n = \ Var(f)\. 

Proof. Initialization step can be done in 0(n) time. The topological sort can be done in 0(m# + 
\V (H) |) time by any standard algorithm. 

At every node u G V(H), to compute set Var(w), the algorithm scans 0(d u ) successors of u, 
where d u = the outdegree of node u in H. Although by Corollary [3j for every two distinct children 
vi, V2 of a --node u, Var(vi) fl Var(v2) = 0, they may have some overlap when u is a +-node, and 
here the algorithm incurs an O(nmn) cost total as follows: (i) create an n-length boolean array for 
u initialized to all zero, (ii) scan Var(v) list of very successor v of u, for a variable x G Var(v), 
if the entry for x in the boolean array is false mark it as true, (iii) finally scan the boolean array 
again to collect the variables marked as true for variables in Var(w). At every node u G V(H), 
the algorithm spends 0(nd u ) time, where d u = the outdegree of node u in H. Hence the total time 
across all nodes = Y,uev(H) Q( n du) = 0{nmn). 

Every check in Step[ToJ i.e., whether an edge (x,y) has already been added and whether the 
tables containing x,y are adjacent in Gt can be done in 0(1) time using 0(n 2 + k 2 ) = 0(n 2 ) 
space. Further, by Lemma [8] and [9j the number of such checks performed is 0(j6#m co ). Since 
Var(/) C V(H), and H is connected, n < \V(H)\ < \E(H)\. Hence the total time complexity is 
0(nm H + p H m co ). □ 



We can now finish to prove Theorem [TJ As shown in Section 3.2 computation of the table 



adjacency graph Gj takes 0(fc 2 aloga) time and this proves the second part of Theorem [TJ The 
time complexity analysis in Lemma [TO] also holds when we modify CompCoTable to compute 
the co-occurrence graph G co instead of the co-table graph Gc'. the only change is that we do not 
check whether the tables containing x,y are adjacent in Gj. Further, we do not need to precompute 
the graph Gj. This proves the first part and completes the proof of Theorem [TJ 
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D Proofs from Section 4 



Modified query in Algorithm [4] evaluates the same expression: 

Lemma 11. Suppose I = [7}' ] , • • • ,Ri p [T/] be the set of input tables to the table decomposition 
procedure TD and let Q() : — R^ (x^), • • • ,i?, p (x| p ) be the input query. Then the expression g gen- 
erated by evaluating query Q on I is exactly the same as evaluating Q on I, where Q = Q\ , ■ • ■ ,Qi 
is the conjunction of modified queries Qj returned by the procedure TD for groups j = 1 to I. 

Proof. We prove that a set of p tuple variables taken from p tables satisfy the original input query 
Q 1 if and only if they satisfy the modified query Q. Since the new query subgoals make some of 
the original variables free, by replacing them with new variables, clearly, if a set of tuples satisfy 
the original query they also satisfy the modified query. So we prove that the modified query does 
not introduce any erroneous collection of tuples in the final answer. 

Consider a set of tuple variables s.t. the corresponding tuples satisfy the modified query. Let 
us partition these variables according to the £ groups of tables as computed by procedure TD. 
Consider component j of the partition and any table i? ; in component j. Recall that Q is the set of 
all variables on the "+"-edges having one end at the table i in component j. A "+" edge between 
table Rj and Tfy implies that the edges between every tuple in Ri and every tuple in Rj exist, which in 
turn implies that, all tuples in i? ; and R? must have the same values of the attributes corresponding 
to C e = x, flXj. Then any set of p tuples taken from p tables must have the same value of attributes 
corresponding to variables in C e . In other words, every variable z G C, can be replaced by a new 
free variable t in every table Rj in component j (note that C ; C Xi) without changing the final 
solution. □ 

Proof of Lemma |H 

Lemma [5} At any step of the recursion, if row decomposition is successful then table decomposi- 
tion is unsuccessful and vice versa. 

Proof. Consider any step of the recursive procedure, where the input tables are Rj l [7}'], • ■ • ,Ri q [T-] 
(Vj, T( C Tij), input query is Q'Q : —R^ (xj x ), • • • ,i?; 9 (xj q ), and the induced subgraphs of Gq and 
Gt on current sets of tuples and tables are G' c and G' T respectively. 

Suppose row decomposition is successful, i.e., it is possible to decompose the tuples in G' c 
into t > 2 connected components. Consider any two tables Ri.Rj such that the edge (Rj,Rj) exists 
in G'j, and consider their sub-tables ] and Rj[Tj] taken from two different connected com- 
ponents in G' c . Consider two arbitrary tuples x G 7^ and x' G Tj. Since x and x' belong to two 
different connected components in G' c , then there is no edge (x,x') in G' c . Hence by Step[3]of the 
table decomposition procedure, this edge (Rj,Rj) will be marked by "— ". Since Rj) was an 
arbitrary edge in G' T , all edges in G' T will be marked by "— " and there will be a unique component 
in G' T using "— " edges. Therefore, the table decomposition procedure will be unsuccessful. 

Now suppose table decomposition is successful, i.e., G' T can be decomposed into i > 2 com- 
ponents using "— " edges. Note that wlog. we can assume that the initial table- adjacency graph 
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Gt is connected. Otherwise, we can run the algorithm on different components of Gj and multi- 
pled the final expressions from different components at the end. Since the procedure TD returns 
induced subgraphs for every connected components, the input subgraph G' T is always a connected 
graph at every step of the recursion. Now consider any two tables Rj and Rj from two different 
groups components such that (Ri,Rj) edge exists in G' T (such a pair must exist since the graph 
G' T is connected). Since this edge is between two components of a successful table decomposition 
procedure, it must be marked with "+". This implies that for any tuple x G Ri and any tuple x' G Rj, 
the edge (x,x f ) exists in G' c (which follows from Step [3] of this procedure). This in turn implies 
that row decomposition must fail at this step since the tables Ri,Rj cannot be decomposed into two 
disjoint components and the graph G' c will be connected through these tuples. □ 

Proof of Lemma HI 

Lemma [6} (Soundness) If the algorithm returns with success, then the expression f* returned by 

the algorithm CompRO is equivalent to the expression Q(I) generated by evaluation of query Q 
on instance I. Further, the output expression f* is in read-once form. 

Proof. We prove the lemma by induction on n, where n = (j£=i \Ti\. The base case follows when 
n = 1. In this case there is only one tuple x, hence k must be 1 as well, and therefore, the algorithm 
returns x in Step [2] Here the algorithm trivially returns with success and outputs a read-once form. 
The output is also correct, since computation of co-table graph ensures that there is no unused tuple 
in the tables, and the unique tuple x is the answer to query Q on database I. 

Suppose the induction hypothesis holds for all databases with number of tuples < n — 1 and 
consider a database with n tuples. If k = 1, then irrespective of the query, all tuples in table R\ 
satisfies the query Q() : — 2?i(xi) (again, there are no unused tuples), and therefore the algorithm 
correctly returns YaeTy x as tne answer which is also in read-once form. So let us consider the case 
when k>2. 

(1) Suppose the current recursive call successfully performs row decomposition, and £ > 2 
components (Ri [I 7 /], • • • ,^[7^]), • • • , (R\ [J 7 /], • • • ,Rk[T^]) are returned. By the row decomposi- 
tion algorithm , it follows that for x G T/ and x' G Tj , x and x' do not appear together in any 
monomial in the DNF equivalent for Q(I). So the tuples which row decomposition puts in differ- 
ent components do not join with each other and then the final answer of the query is the union of 
the answers of the queries on the different components. Then the final expression is the sum of the 
final expressions corresponding to the different components. Since all components have < n tuples 
and the algorithm did not return with error in any of the recursive calls, by the inductive hypothesis 
all the expressions returned by the recursive calls are the correct expressions and are in read-once 
form. Moreover these expressions clearly do not share variables - they correspond to tuples from 
different tables since the query does not have a self-join. We conclude that the final expression 
computed by the algorithm is the correct one and is in read-once form. 

(2) Otherwise, suppose the current step successfully performs table decomposition. Let I > 2 
groups Ri,-- - ,R| are returned. Correctness of table decomposition procedure, i.e., correctness 
of the expression /* = f\ fg, when all the recursive calls return successfully follows from 



Lemma 1 1 using the induction hypothesis (the algorithm multiplies the expressions returned by 
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different groups which themselves are correct by the inductive hypothesis). Further, since all 
components have < n tuples, and the algorithm did not return with error in any of the recursive 
calls, all expressions returned by the recursive calls are in read-once form. Since they do not share 
any common variable, the final output expression is also in read-once form. □ 

Proof of Lemma [Vj 

Lemma [7J (Completeness) If the expression Q(I) is read-once, then the algorithm CompRO 
returns the unique read-once form f* of the expression. 

Proof. Suppose the expression is read-once and consider the tree representation T* of the unique 
read-once form /* of the expression (T* is in canonical form and has alternate levels of + and • 
nodes, which implies that every node in T* must have at least two children.). We prove the lemma 
by induction on the height h of tree T*. 

First consider the base case. If h = 1, then the tree must have a single node for a single tuple 
variable x. Then k must be 1 and the algorithm returns the correct answer. So consider h>2. 

(1) Consider the case when root of the tree is a + node. If h = 2, since we do not allow union 
operation, k must be 1 and all the tuples must belong to the same table Ri . This is taken care of by 
Step[2]of CompRO. If h > 2, then k must be > 2 and the answer to the join operation must be non- 
empty. Every child of the root node corresponds to a set of monomials which will be generated by 
the equivalent DNF expression fuNF for the subtree rooted at that child. Note that no two variables 
in two different children of the root node can belong to any monomial together since the tree T* is 
in read-once form. In other words, they do not share an edge in Gq- Hence the component formed 
by the set of variables at a child will not have any edge to the set of variables at another child of 
the root node. This shows that all variables at different children of the root node will belong to 
different components by the row decomposition procedure. 

Now we show that variables at different children of the root node are put to different compo- 
nents by the row decomposition procedure, which shows that the row decomposition algorithm will 
divide the tuples exactly the same was as the root of T* divides tuples among its children. Since T* 
is in canonical read-once form and has alternate levels of + and • nodes, then row decomposition 
cannot be done within the same subtree of a + node. So all variables in a subtree must form a 
connected component. Since the root has > 2 children, in this case we will have a successful row 
decomposition operation. By inductive hypothesis, since the subtrees rooted at the children of the 
root are all in read-once form, the recursive calls of the algorithm on the corresponding subtrees 
are successful. Hence the overall algorithm at the top-most level will be successful. 

(2) Now consider the case when root of the tree is a • node. Note that the ■ operator can only 
appear as a result of join operation. If the root has £' > 2 children ci, • • ■ ,c#, then every tuple x 
in the subtree at cj joins with every tuple y in the subtree at Cf for every pair 1 < j ^ f < £'. 
Moreover since the query does not have a self join, x 7 y must belong to two different tables, which 
implies that there is an edge (x,y) in Gq between every pair of tuples x,y from subtrees at Cj.Cf 
respectively. Again, since we do not allow self-join, and T* is in read-once form, the tuples in the 
subtrees at c/,c/ must belong to different tables if j ^ f. In other words, the tables R\, ■ ■ ■ are 
partitioned into £' disjoint groups Ri, • • • ,R«. 
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Next we argue that £ = £' and the partition returned by the table decomposition procedure 
Ri, • ■ ■ ,R^ is identical to Rj, ■ • • ,R^ upto a permutation of indices. Consider any pair Rj and Rj,. 
Since the tuple variables in these two groups are connected by a ■ operator, all tuples in all tables in 
Rj join with all tuples in all tables in Rj,. In other words, for any pair of tuples x,x! from Ri l G R- 
and Ri 2 G Rj,, there is an edge (x,xf) in co-occurrence graph G co . Hence if there is a common 
subset of join attributes between and Ri 2 , i.e. the edge (R^ ,i? ; - 2 ) exists in Gj, it will be marked 
by a "+" (all possible edges between tuples will exist in the co-table graph Gt). So the table 
decomposition procedure will put Rj and Rj, in two different components. This shows that £ > £'. 
However, since T* is in read-once form and has alternate levels of + and • nodes, no Rj can be 
decomposed further using join operation (i.e. using "+" marked edges by the table decomposition 
procedure); therefore, £' = £. Hence our table decomposition operations exactly outputs the groups 
Rj, • ■ ■ ,~R'p t . By the inductive hypothesis the algorithm returns with success in all recursive calls, 
and since £' = £>!, the table decomposition returns with success. So the algorithm returns with 
success. □ 



D.l Time Complexity of CompRO 

Here we discuss the time complexity of algorithm CompRO in detail and show that algorithm 
CompRO runs in time C^m^alog a + (mc + n) min(fc, \fn)). We divide the time complexity com- 
putation in two parts: (i) total time required to compute the modified queries across all table de- 
composition steps performed by the algorithm (this will give 0{mj a log a) time) and (ii) total time 
required for all other steps: here we will ignore the time complexity for the modified query com- 
putation step and will get a bound of (mc + n) min(£, y/n)) . First we bound the time complexity of 
individual row decomposition and table decomposition steps. 

Lemma 12. The row decomposition procedure as given in Algorithm^runs in time 0(m' c + n'), 
where n' = Uj=\ 1^1/ 1 = the total number of input tuples to the procedure, and m' c = the number of 
edges in the induced subgraph of Gq on these n' tuples. 

Proof. The row decomposition procedure only runs a connectivity algorithm like BFS/DFS to 
compute the connected components. Then it collects and returns the tuples and computes the 
induced subgraphs in these components. All these can be done in linear time in the size of the 
input graph which is 0(m' c + n'). □ 

Next we show that the table decomposition can be executed in time 0(m' c + n') as well. 

Lemma 13. The table decomposition procedure as given in Algorithm^runs in time 0(m' c + n'), 
ignoring the time required to compute the modified queries Qj where n' = Y/j=\ l^' l = the total 
number of input tuples to the procedure, and m' c = the number of edges in the induced subgraph 
of Gq on these n' tuples. 



Proof. Step [3] in the table decomposition procedure marks edges in G' T using G' c . Let us assume 
that G' c has been represented in a standard adjacency list. Consider a table Rj[Tj], where Tj C Tj 
and let d be the degree of Rj in G' T . Now a linear scan over the edges in G' c can partition the edges 
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e from a tuple x G Tj in table Rj into E\, ■ • ■ ,Ej, where E q (q E [l,d]) contains all edges from x to 
tuples xf, belonging to the q-th neighbor of Rj. A second linear scan on these grouped adjacency 
lists computed in the previous step is sufficient to mark every edge in G' T with a "+" or a "— ": 
for every neighbor q of Rj, say Rj/, for every tuple x in Tj, scan the q-th group in adjacency list to 
check if x has edges with all tuples in Rjr[Tj,]. If yes, then all tuples in Rji also have edges to all 
tuples in Rj, and the edge (Rj,R;r) is marked with a "+". Otherwise, the edge is marked with a 
"— ". Hence the above two steps take 0(m' c + n' + m' T + k') time, where k' and m' t are the number 
of vertices (number of input tables) and edges in the subgraph G' T . 

Finally returning the induced subgraphs of G' T for the connected components and decompo- 
sition of the tuples takes 0(m' c + n' + m' T + k') time. Since n' > k' and m' c > m' T , not consid- 
ering the time needed to recompute the queries, step, the total time complexity is bounded by 
0(m' c + n'). □ 

The next lemma bounds the total time required to compute the modified queries over all calls 
to the recursive algorithm. 

Lemma 14. The modified queries Qj over all steps can be computed in time 0(mja\og a), where 
a is the maximum size of a subgoal. 

Proof. We will use a simple charging argument to prove this lemma. For an edge e = (Ri,Rj) in 
Gj, the common variable set C e = X{ flXjF^can be computed by (i) first sorting the variables in Xj, xj 
in some fixed order, and then (ii) doing a linear scan on these sorted lists to compute the common 
variables. Here we to compute the set C e . Hence this step takes O(aloga) time. Alternatively, 
we can use a hash table to store the variables in x,, and then by a single scan of variables in Xj and 
using this hash table we can compute the common attribute set C e in 0(a) expected time. When C e 
has been computed in a fixed sorted order for every edge e incident on i?, to a different component, 
the lists C e -s can be repeatedly merged to compute the variables set C, = \J e C e in O(dta) time 
(note that even after merging any number of C e sets, the individual lists length are bounded by 
the subgoal size of Rj which is bounded by a). However, instead of considering the total time 
0(di<x) for the node Rj in Gj, we will charge every such edge e = (Rf,Rj) in Gj for this merging 
procedure an amount of 0(a). So every edge e from Rt to an Rj in different component gets a 
charge of O(aloga). 

Suppose we charge the outgoing edges (Ri,Rj) from Ri to different components by a fixed cost 
of P, P = 0(a) in the above process. From the table decomposition procedure it follows that, 
the common join attributes are computed, and the query is updated, only when the edge (Ri,Rj) 
belongs to the cut between two connected components formed by the "— " edges. These edges then 
get deleted by the table decomposition procedure: all the following recursive calls consider the 
edges inside these connected components and the edges between two connected components are 
never considered later. So each edge in the graph Gj can be charged at most once for computation 
of common join attributes and this gives 0(mc)aloga as the total time required for this process. 

Finally, the variables in Xi can also be replaced by new variables using the sorted list for Q in 
0(a) time, so the total time needed is 0(mca log a + na) = 0(mca\og a) (since we assumed Gj 
for the query Q is connected without loss of generality). □ 

12 We abuse the notation and consider the sets corresponding to vectors x,,Xj to compute C e 
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Now we show that the depth of the recursion tree is 0(min(k, y/n) ) and in every level of the tree, 
the total time required is at most 0(mc + n). Let us consider the recursion tree of the algorithm 
CompRO and wlog. assume that the top-most level performs a row decomposition. Since the 
size of the table-adjacency subgraph G' T is always dominated by the co-table subgraph G' c at any 
recursive call of the algorithm, we express the time complexity of the algorithm with k tables, 
and, n tuples and m edges in the subgraph G' c as 7\ (ft, «?,&), where the top-most operation is a 
row decomposition. Further, every component after row decomposition has exactly k tables and 
therefore must have at least k tuples, because, we assumed wlog. that initial table adjacency graph 
Gj is connected and there is no unused tuples in the tables. Similarly, T2{n,m,k) denotes the time 
complexity when the top-most operation is a table decomposition operation. Note that at every 
step, for row decomposition, every tuple and every edge in G' c goes to exactly one of the recursive 
calls of the algorithm; however, the number of tables k remains unchanged. On the other hand, for 
table decomposition operation, every tuple goes to exactly one recursive call, every edge goes to at 
most one such calls (edges between connected components are discarded), and every table goes to 
exactly one call. Recall that the row and table decomposition alternates at every step, and the time 
required for both steps is 0(m + n) (not considering computation of modified queries at every table 
decomposition steps) so we have the following recursive formula for T\{n,m,k) and T2{n,m,k). 

£ 

T\(n,m,k) = 0(m + n) + '^T2(nj 1 mj 1 k) 

7=1 

£ I 

where ^ ft j = n.^mj = m, rij > £:V j 
7=1 J i 

£ 

T2(n 7 m 7 k) = 0(m + n) +^T\(rij 1 mj 1 kj) 

7=1 

£ £ £ 

where ^ nj = ft, ^ mj < m.^kj = k 

7=1 7=1 7=1 

where nj^mj and kj are the total number of tuples and edges in G' c , and the number of tables for 
the j-th recursive call (for row decomposition, kj = k). For the base case, we have T2(rij,mj, 1) = 
O(rij) - for k = 1, to compute the the read once form, O(nj) time is needed; also in this case 
nij = (a row decomposition cannot be a leaf in the recursion tree for a successful completion of 
the algorithm). Moreover, it is important to note that for a successful row or table decomposition, 
£> 2. 

If we draw the recursion tree for Ti{n,mc,k) (assuming the top-most operation is a row- 
decomposition operation), at every level of the tree we pay cost at most 0(mc + n). This is because 
the tuples and edges go to at most one of the recursive calls and k does not play a role at any node 
of the recursion tree (and is absorbed by the term 0(mc + ft)). 

Now we give a bound on the height of the recursion tree. 

Lemma 15. The height of the recursion tree is upper bounded by 0(min(fc, y/n)). 
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Proof. Every internal node has at least two children and there are at most k leaves (we return from 
a path in the recursion tree when k becomes 1). Therefore, there are 0(k) nodes in the tree and the 
height of the tree is bounded by 0(k) (note that both the number of nodes and the height may be 
&(k) when the tree is not balanced). 

Next we show that the height of the recursion tree is also bounded by 4y/n. The recursion tree 
has alternate layers of table and row decomposition. We focus on only the table decomposition 
layers, the height of the tree will be at most twice the number of these layers. Now consider any 
arbitrary path P in the recursion tree from the root to a leaf where the number of table decompo- 
sitions on P is h. Suppose that in the calls T2,(n,m,k), the values of n and k along this path (for 
the table decomposition layers) are (no,&o), («i . . . , {n^kh), where ko = k and hq < n (if the 
top-most level has a table-decomposition operation, then hq = n). We show that h < 2y/n. 

Let's assume the contradiction that h > 2^/n and let's look at the first p = 2y/n levels along the 
path P. If at any j'-th layer, j &[l,p],kj < 2^/n — j, then the number of table decomposition steps 
along P is at most 2^/n: every node in the recursion tree has at least two children, so the value of 
k decreases by at least 1 . The number of table-decomposition layers after the y'-th node is at most 
kj, and the number of table-decomposition layers before the j-th node is exactly j. Therefore, the 
total number of table-decomposition layers is < 2^fn). 

Otherwise, for all j e [l,p], kj > 2^/n — j. Note that rij < rij- 1 — kj- 1 : there is a row decompo- 
sition step between two table decompositions, and every component in the j-th row decomposition 
step will have at least kj nodes. If this is the case, we show that n p < 0. However, 

n p < n p -\-k p -\ 

< n p -2 — kp-2 — k p - i 

p-i 

< n - £ kj 

j=o 
p-l 

< n-^kj 

7=0 
2-v/n-l 

= n-j^j 

7=1 

2^(2^+1) 

= H 2 

= n — 2n — ^fn 

< 

which is a contradiction since n p is the number of nodes at a recursive call and cannot be negative. 
This shows that along any path from root to leaves, the number of table decomposition layers is 
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bounded by 2y/n which in turn shows that the height of the tree is bounded by Ay/n. □ 

Since total time needed at every step of the recursion tree is 0(mc + ri), we have the following 
corollary, 

Corollary 1. Not considering the time complexity to compute the modified queries by the table 
decomposition procedure, the algorithm CompRO runs in time 0((mc + n)min(k,y/n)). 

The above corollary together with Lemma [14] (which says that to compute the modified queries 
0(mT<x\o% a) time suffices) shows that CompRO runs in time 0(mr ttlog a + (mc + n) min(fc, \/n)) 
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